Web page classification using spatial information

نویسندگان

  • Miloš Kovacevic
  • Michelangelo Diligenti
  • Marco Gori
  • Marco Maggini
  • Veljko Milutinovic
چکیده

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words” and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Such spatial information allows the definition of heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show a preliminary experiment where our heuristics are able to correctly recognize objects in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification

Standard techniques for a web page classification usually take a simple text-based approach, in which most of the information provided by the visual layout of a page is discarded. In our work we propose a new classification approach based on the visual layout analyses, conducted before implementing standard classification techniques. A page is represented as a hierarchical structure – Visual Ad...

متن کامل

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords...

متن کامل

An Improved Optimized Web Page Classification using Firefly Algorithm with NB Classifier (WPCNB)

The web is a huge repository of information which needs for accurate automated classifiers for Web pages to maintain Web directories and to increase search engines‟ performance. In web page classification problem each term in each HTML/XML tag of each Web page can be taken as a feature, an efficient methods to select best features to reduce feature space of the Web page classification problem d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002